Conditional Random Fields based Pronominal Resolution in Tamil
نویسنده
چکیده
This paper deals with Tamil pronominal resolution using Conditional Random Fields a machine learning approach. A detailed linguistic analysis of Tamil pronominals and its antecedence occurring in various syntactic constructs is done, which led to the selection of appropriate features for CRF approach. The syntactic features thus identified made the system learn most frequently occurring pronoun antecedent pattern from the training corpus. The performance of the system is highly encouraging. Keywords-Anaphora; Antecedent; pronominal;Anaphor; CRF++; Ι. INTRODUCTION Pronominal Resolution is a well studied area for English but for Indian languages not sufficient work has been done. The process of finding the antecedent of a pronoun is pronominal resolution. In this paper we analyze the third person pronominals in Tamil avan “he”, aval “she” and atu “it”. Consider the following example Consider the following examples, Krishnan avanaii maatRRinaan Krishnan he+acc changed+3sgm 'Krishnan changed him.' (ex.1) In ex.1 Krishnan is the subject and is also a proper noun with nominative case marker. Even though Krishnan is the subject with nominative case marker, it cannot be the antecedent for the anaphor avanaii. Hence the pronoun avanaii does not refers to Krishnan but it refers somebody in the previous sentences. Krishnani avanaii maatRik koNtaan. Krishnan he+acc change got+3sgm 'Krishnan changed himself' (ex.2) In ex.2 Krishnan is the subject and is also a proper noun with nominative case marker. Here the pronoun avanaii with accusative case marker refers Krishnan because of the verb koNtaan (got+3sgm). I. RELATED WORK It is observed that approaches to anaphora resolution usually rely on a set of anaphora resolution factors. Factors used frequently in the resolution process include gender,person and number agreement, c-command constraints, semantic consistency, syntactic parallelism, semantic parallelism, salience, proximity etc. One of the early works in pronominal resolution is by Hobb's naive approach, which relies on semantic information (Hobbs. J, 1978). Carter with Wilkas' common sense inference theory came up with a system (Carter. D, 1987). Carbonell and Brown's introduced an approach of combining the multiple knowledge system (Carbonell. J. G. & Brown. R .D, 1988). The initial approaches, where broadly classified as knowledge poor and rich approach. Syntax based approach by Hobb (naive approach), centering theory based approaches (Joshi, A. K. & Kuhn. S, 1979; Joshi, A. K. & Weinstein.S, 1981) and factor/indicator based approach such as Lappin and Leass’ method of A.Akilandeswari et.al / International Journal on Computer Science and Engineering (IJCSE) ISSN : 0975-3397 Vol. 5 No. 06 Jun 2013 567 identifying the antecedent using a set of salience factors and weights associated to it. This approach requires deep syntactic analysis. Ruslan Mitkov introduced two approaches based on set of indicators, MOA (Mitkov’s Original Approach) and MARS (Mitkov’s Anaphora Resolution System) (Mitkov. R,1998). These indicators return a value based on certain aspects of the context in which the anaphor and the possible antecedent can occur. The return values range from -1 to 2. MOA does not make use of syntactic analysis, whereas MARS system makes use of shallow dependency analysis. Several coreference resolution systems are currently publicly available. JavaRap (Qiu et al., 2004) is an implementation of the Lappin and Leass’ (1994) Resolution of Anaphora Procedure (RAP). JavaRap resolves only pronouns and, thus, it is not directly comparable to Reconcile. GuiTaR (Poesio and Kabadjov, 2004) and BART (Versley et al., 2008) (which can be considered a successor of GuiTaR) are both modular systems that target the full coreference resolution task. In addition, the architecture and system components of Reconcile (including a comprehensive set of features that draw on the expertise of state-of-the-art supervised learning approaches, such as Bengtson and Roth (2008)) result in performance closer to the state-of-the-art. In addition, the architecture and system components of Reconcile (including a comprehensive set of features that draw on the expertise of state-of-the-art supervised learning approaches, such as Bengtson and Roth (2008)) result in performance closer to the state-of-the-art. Johansson, C. (Ed.) Proceedings of the Second Workshop on Anaphora Resolution (2008). A. Anaphora Resolution In Indian Languages Some of the works done in Anaphora Resolution for the Indian languages are as follows, VASISTH a rule based system which works with shallow parsing and exploits the rich morphology in Indian languages for identifying the antecedent for anaphors (Sobha.L & Patnaik.B.N, 1999). Pronominal resolution in tamil using machine learning (Murthi.K.N, Sobha.L, 2007 ) work showed that the task may be feasible, and depend on the reliability of language specific features such as person number, gender and case marking. Also have looked into the anaphora resolution in Tamil, using Machine Learning technique: Linear Regression and compared it with salience factors. Dhar worked on “A method for pronominal anaphora resolution in Bengali (Dhar.A & Garain.U, 2008). Sobha.L & Pralayankar.P, 2008) worked on "Algorithm for Anaphor Resolution in Sanskrit". Resolving Pronominal Anaphora in Hindi Using Hobb’s Algorithm was done by Kamalesh Dutta. Sobha L, (1999) “Anaphora Resolution In Malayalam and Hindi” Doctoral dissertation submitted to Mahatma Gandhi University, Kottayam , Kerala. Dhar worked on “A method for pronominal anaphora resolution in Bengali (Dhar.A & Garain.U, 2008; Sobha.L & Pralayankar.P, 2008) worked on "Algorithm for Anaphor Resolution in Sanskrit". Resolving Pronominal Anaphora in Hindi Using Hobb’s Algorithm was done by Kamalesh Dutta. In ICON 2011, NLP tool contest on Anaphora Resolution for Indian Languages was held. The tool contest considered the languages such as Bengali, Hindi, Odiya, Marathi and Tamil. In each language different methods was approached by the participants. The paper is organized as given below. The introduction is followed by the linguistic analysis of pronominals in Tamil with examples and type of pronominals. In section 5 we discuss in detail the implementation of the system which includes feature selection, CRFs and the pre-processing modules. The last section deals with the evaluation and result followed by conclusion. II. AN OVERVIEW OF TAMIL LANGUAGE Tamil belongs to South Dravidian family of languages.Tamil is predominantly a free word order language. Generally Tamil sentence follows the subject, object, and verb pattern. The interchange of subject, object is common. Tamil is morphologically rich and word order free. It has post-positions. It is nominative-accusative language like the other Dravidian languages. The construction of sentences has nominative subjects in Tamil. There are constructions with certain verbs that require dative subjects and possessive subjects. Tamil has PNG (person, number and gender) agreement. First and second person singular and plural are used as deictic, though they are used in anaphoric form in discourse. There are many approaches to solve this problem such as rule based, statistical and machine learning based approaches. III. TYPES OF ANAPHORA Pronominal anaphor: Pronominals in Tamil should agree in gender, number and person with its antecedent. A.Akilandeswari et.al / International Journal on Computer Science and Engineering (IJCSE) ISSN : 0975-3397 Vol. 5 No. 06 Jun 2013 568 Krishnani viittiRkku vantaan. avani avanutaiyai naarkaliyil amarntaan. Krishnan home+dat came+3sgm. He he+genitive chair+loc sat+3sgm. 'Krishnan came home. He sat on his chair' (ex.3) In ex.3, there are two anaphors avani and avanutaiyai. The anaphor avanutaiyai refers avani and avani refers Krishnani. . Quantifier/Ordinal: The anaphor is a quantifier such as one, two etc and has its own suffixes. Krishnan oru putu Penai vankinaan. Ramanum onraii vaankinaan. Krishnan one new pen bought+3sgm. Raman+um one+acc bought+3sgm. 'Krishnan bought a new pen and raman also bought one.' (ex.4) In ex.4 onraii is a quantifier which is a pronoun refers Penai in the previous sentence. Pleonastic anaphor: The pronouns 'atu' (it) itself refer to nothing particular in the text. atui oru maalai neram. It one evening time 'Its an evening' (ex.5) In ex.5 atui is the anaphor which do not refers anything in the text. So it is non-anaphoric. Whole and the part anaphor: This category of anaphor where the anaphor refers to some real world knowledge which has not been mentioned previously anywhere in the discourse. Krishnan matikkaNanii vaanki ullan. Inta iyantirattaii avanal ella itattiRkkum etuttu cella mutikiratu. Krishnan laptop bought be. This iyantiram+acc he+ins all place+dat take go able 'Krishnan bought a laptop . He is able to take this machine every where' (ex.6) Here matikkaNanii is an anaphor and iyantirattaii is the antecedent. This type of anaphors need world knowledge to resolve. This work mainly focussed towards pronominal anaphors. IV. TYPES OF PRONOMINAL ANAPHORA IN TAMIL The pronominal anaphora in Tamil is further classified into personal anaphors, possessive anaphors, reflexive anaphors, demonstrative anaphors and relative anaphors. Most pronominal anaphora resolution algorithms only account for anaphors referring to individual entities. We have taken the anaphors (avan, aval, atu) for our work and it is given below. The pronominal anaphora in Tamil are based on personal anaphors such as avan/he, avaL/she, atu/it and avarkaL/they,their. A. Possessive Anaphora The possessive anaphors end with morphemes such as utaiya, atu and in. The anaphors and their inflected forms are given below.
منابع مشابه
Domain Focused Named Entity Recognizer for Tamil Using Conditional Random Fields
In this paper, we present a domain focused Tamil Named Entity Recognizer for tourism domain. This method takes care of morphological inflections of named entities (NE). It handles nested tagging of named entities with a hierarchical tagset containing 106 tags. The tagset is designed with focus to tourism domain. We have experimented building Conditional Random Field (CRF) models by training the...
متن کاملAutomatic Conversion of Dialectal Tamil Text to Standard Written Tamil Text using FSTs
We present an efficient method to automatically transform spoken language text to standard written language text for various dialects of Tamil. Our work is novel in that it explicitly addresses the problem and need for processing dialectal and spoken language Tamil. Written language equivalents for dialectal and spoken language forms are obtained using Finite State Transducers (FSTs) where spok...
متن کاملTamil NER – Coping with Real Time Challenges
This paper describes various challenges encountered while developing an automatic Named Entity Recognition (NER) using Conditional Random Fields (CRFs) for Tamil. We also discuss how we have overcome some of these challenges. Though most of the challenges in NER discussed here are common to many Indian languages, in this work the focus is on Tamil, a South Indian language belonging to Dravidian...
متن کاملClassification of high resolution remote sensing image based on Geo- ontology and Conditional Random Fields
The availability of high spatial resolution remote sensing data provides new opportunities for urban land-cover classification. More geometric details can be observed in the high resolution remote sensing image, Also Ground objects in the high resolution remote sensing image have displayed rich texture, structure, shape and hierarchical semantic characters. More landscape elements are represent...
متن کاملMultitemporal Crop Type Classification Using Conditional Random Fields and Rapideye Data
The task of crop type classification with multitemporal imagery is nowadays often done applying classifiers that are originally developed for single images like support vector machines (SVM). These approaches do not model temporal dependencies in an explicit way. Existing approaches that make use of temporal dependencies are in most cases quite simple and based on rules. Approaches that integra...
متن کامل